Sarah Van Oss
Department of Anthropology, School of Liberal Arts, Tulane University
Pulling It TogetheR: Collecting and Collating Quantitative Data from Written
Reports for Analysis using R and ChatGPT
ACKNOWLEDGEMENTS
Thank you to the Proyecto Arqueológico Waka’ for their continued
support and facilitation of this research; the Middle American
Research Institute at Tulane University and Dr. Marcello Canuto for
continued research support.
GOAL: DEVELOP CODING IN R TO
COLLECT DATA FROM TEXTS
•Create an efficient system for data
collection and organization
•Reduce data collection time
•Facilitate collation and comparison of
data across investigations, time, space
•Lay groundwork for answering questions
using large datasets, relational databases
PROBLEM: RETRIEVING DATA FROM
ARCHIVAL OR TEXT DOCUMENTS
Accessing Data from Archival or Text Documents
Much archaeological data is housed in written reports.
Published reports, such as informes used throughout
Mesoamerican archaeology, communicate findings to
governing bodies and the public. Accessing this data for
later analysis, however, can be time-consuming and
difficult.
Collecting Data Manually From Written Documents
When done manually, data collection from written texts
varies greatly and can take days to weeks. This is
particularly true when the investigator is unfamiliar with
the language, concepts, or project.
Creating Comparable, Large Datasets
Data housed in such reports are difficult to compare with
other datasets due to medium and formatting. Comparing
data from different years, investigators, projects, sites,
etc. must be done manually.
Answering Big Questions With Big Data
Research questions in archaeology are limited by the
availability of data. This includes accessibility of the data
record (e.g. written reports) and what data is feasible to
produce through traditional excavation. If data from
written reports can be accessed and understood quickly,
it will lead to better formulated research questions, more
efficient excavations, and the opportunity to expand
research scope beyond a single investigator’s capacity.
METHOD
1. Compile project report texts into a readable file
I use Excel to create separate observations, or data
entries, for analysis by the R code. To segment written
archaeological data, each observation equates to one
lot—the smallest unit of excavation.
2. Define the data categories desired for collection
e.g. The primary data needed for this project are artifact
counts, excavation depths, and contextual designations.
3. Identify and collapse text patterns for each data
category
Textual patterns vary in written language. In order for the
code to correctly collect data, it is necessary to identify
the words and phrases associated with each data
category. For instance, “ceramics” might be indicated by
words like “sherds,” “pieces,” “fragments,” etc.
4. Extract quantities associated with each textual
pattern for each observation
Once text patterns that indicate a data category have
been identified, R can recognize those patterns and
extract the quantities associated with each observation.
5. Compile quantities into a database for analysis
Once collected, these data are compiled into a database
for large-scale and comparative analysis.
RESULTS
Data Collected
The table below was built using the methods described,
“translated” from written reports to analyzable database.
The code is structured for easy expansion or contraction
based on what data is required to answer relevant
research questions. This example uses only ceramics:
larger databases can include lithic, obsidian, and figurine
counts, or other data like Munsell designations or
elevation measurements.
Creating and Analyzing a Database
After data is collected, it is combined with other report
data and compiled into a larger database for further
analysis. This database could be a simple spreadsheet or
a more elaborate relational database. Project data can
then be efficiently analyzed and visualized, as shown
below (number of lithics per lot).
DISCUSSION/CONCLUSION
Once developed, implementing this program in R
reduces data collection time by up to 80% when
compared to manual entry.
High variability in textual patterns require constant
iteration of the code and consistent data verification.
Some cleaning and manipulation of the text is
required to ensure data accuracy.
Variation across investigators, projects, sites,
excavation methods, etc. is to be expected. These
changes can be accounted for during collection and
data verification.
Future Directions:
1) After collection, data will be collated into larger
databases for large-scale comparison.
2) Future research will explore the use of relational
databases and structured query language (SQL) to
further improve data comparison across a project’s
timeline and among various projects.
3) This method can also encompass spatial data for
analysis in a spatial software or GIS.
SIGNIFICANCE/BROADER IMPACTS
Implementing this data collection method
allows investigators to address more
expansive research questions through the
creation of larger datasets in less time,
saving investigation funds and labor-hours.
Leveraging automated code is an easily
adoptable solution for managing and
analyzing extensive data collected by long-
term projects or held within archives,
enabling re-engagement with previous
research and the expansion of current
efforts.
Creating a code to collect quantitative data
from text documents and making it publicly
available will facilitate greater access to data
from publicly available sources, allowing for
more equitable access to data for scholars
and community-led research initiatives.